Commit Graph

1791 Commits

Author SHA1 Message Date
Rob Howie ff7b81dc8f fix: add health check to detect and recover stale EphemeralRunner registrations
fix: add health check to detect and recover stale EphemeralRunner registrations

## Summary

Adds a GitHub-side registration health check to the `EphemeralRunnerReconciler`. During reconciliation of a running EphemeralRunner that has a valid `RunnerId`, the controller now calls `GetRunner()` against the GitHub Actions API to verify the registration still exists. If the API returns 404 (registration gone), the runner is marked as failed so the `EphemeralRunnerSet` can provision a replacement.

This also fixes `deleteRunnerFromService` to tolerate 404 from `RemoveRunner`, since a runner whose registration was already invalidated will also 404 on removal. Previously this caused `markAsFailed` to error and trigger a requeue loop, preventing cleanup of stale runners.

A redundant pod phase check was removed from the health check path — `checkRunnerRegistration` is only called from within the `cs.State.Terminated == nil` branch (container confirmed running), so the phase check was unnecessary and could cause false negatives since the local `EphemeralRunner` object may not have the updated phase yet.

## Changes

- **`ephemeralrunner_controller.go`**: New `checkRunnerRegistration()` method that calls `GetRunner()` and returns unhealthy only on confirmed 404. Transient API errors (500, timeouts, etc.) are logged and ignored to avoid false positives. Called during reconciliation when a running pod has a non-zero `RunnerId`.
- **`ephemeralrunner_controller.go`**: `deleteRunnerFromService()` now treats 404 from `RemoveRunner` as success, since the runner is already gone.
- **`ephemeralrunner_controller_test.go`**: Two new test cases — one verifying a 404 from `GetRunner` marks the runner as failed, another verifying a 500 does not. Both simulate a running pod container status before the health check triggers.
- **`fake/client.go`**: New `WithRemoveRunnerError` option for the fake client to support testing the 404 removal path.

## Issues Fixed

### Directly fixes

- **#4396** — *ARC runners did not recover automatically after GitHub Outage*: Runners that lost their GitHub-side registration during the [2026-03-05 GitHub Actions incident](https://www.githubstatus.com/incidents/g9j4tmfqdd09) remained stuck indefinitely because the `EphemeralRunnerReconciler` never verified registration validity. This change adds exactly that verification — on each reconciliation of a running runner, `GetRunner()` confirms the registration exists. If it returns 404, the runner is marked failed and replaced.

- **#4395** — *GHA Self-hosted runner pods are running but the runner status is showing offline*: Runner pods were running for 5-7+ hours showing "Listening for Jobs" but marked offline in GitHub. The pods' registrations were invalidated server-side (same incident), but the controller had no mechanism to detect this. The new health check detects the 404 and triggers cleanup, preventing these zombie runners from accumulating.

### Partially addresses

- **#4397** — *Stale TotalAssignedJobs causes permanent over-provisioning after platform incidents*: This issue has two components: (1) zombie runner pods occupying slots, and (2) the listener's `TotalAssignedJobs` remaining inflated. This PR fixes component (1) — by detecting and cleaning up runners with invalidated registrations, the controller stops accumulating zombie pods. However, the listener-side `TotalAssignedJobs` inflation (a separate code path in `listener.go` → `worker.go`) is not addressed by this change and still requires either a session reconnect mechanism or CR deletion to clear.

- **#4307** — *Ephemeral Runners seem to get stuck when job is canceled or interrupted*: Reports runners getting stuck with `Registration <uuid> was not found` errors in BrokerServer backoff loops after job cancellation. The health check would detect that the registration is gone (404 from `GetRunner`) and mark these runners as failed rather than leaving them in an infinite backoff loop.

- **#4155** — *EphemeralRunner and its pods left stuck Running after runner OOMKill*: Runners that are OOMKilled can end up in a state where the pod is technically running but the runner process is non-functional, and the registration may become stale. The health check provides a secondary detection mechanism — if the GitHub-side registration is invalidated for a non-functional runner, it will be caught and cleaned up.

- **#3821** — *AutoscalingRunnerSet gets stuck thinking runners exist when they do not*: Reports the AutoscalingRunnerSet believing runners exist when no pods are present, requiring CR deletion to recover. While the root cause may vary, stale registrations that the controller cannot clean up (due to `RemoveRunner` 404 errors causing requeue loops) are one contributing factor. The 404 tolerance in `deleteRunnerFromService` directly addresses this cleanup path.

## Test Plan

- [x] Unit test: `GetRunner` returning 404 → runner marked as failed
- [x] Unit test: `GetRunner` returning 500 → runner NOT marked as failed (transient error tolerance)
- [x] Unit test: `RemoveRunner` returning 404 → deletion succeeds (runner already gone)
- [ ] Integration: Deploy to a test cluster, manually delete a runner's registration via the GitHub API, verify the controller detects and replaces it
- [ ] Integration: Simulate incident conditions by blocking `broker.actions.githubusercontent.com` for running runners, then unblocking — verify stale runners are detected and replaced
2026-03-09 09:31:17 +00:00
github-actions[bot] 396ee88f5a
Updates: runner to v2.332.0 container-hooks to v0.8.1 (#4388)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-03-03 01:02:40 +01:00
gateixeira 1f615c1a33
feat: add default linux nodeSelector to listener pod (#4377)
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2026-02-24 17:56:39 +01:00
Nikola Jokic 8b7fd9ffef
Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
Nikola Jokic c6e4c94a6a
Fix tests and generate mocks (#4384) 2026-02-24 13:36:01 +01:00
dhawalseth 9de09f56eb
Include the HTTP status code in jit error (#4361)
Co-authored-by: Dhawal Seth <dseth@linkedin.com>
2026-01-29 16:40:17 +01:00
Caius Durling 02aa70a64a
Fix `AcivityId` typo in error strings (#4359) 2026-01-21 01:14:26 +01:00
Jiaren Wu d3ca9de3ca
Potential fix for code scanning alert no. 7: Use of a broken or weak cryptographic hashing algorithm on sensitive data (#4353)
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2026-01-14 21:04:02 -08:00
github-actions[bot] a868229fe0
Updates: runner to v2.331.0 (#4351)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-01-14 13:32:39 -05:00
Nikola Jokic a505fb5616
Prepare 0.13.1 release (#4341) 2025-12-23 14:57:05 +01:00
Nikola Jokic bfe78ccd5d
Make restart pod more flexible to different failure scenarios (#4340) 2025-12-19 15:49:42 +01:00
dependabot[bot] 3fd1048576
Bump golangci/golangci-lint-action from 9.1.0 to 9.2.0 in the actions group (#4335)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-16 14:15:21 +01:00
dependabot[bot] 180e0dabb2
Bump the gomod group across 1 directory with 10 updates (#4338)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-12-16 14:05:44 +01:00
Nikola Jokic 50038fba61
Re-schedule if the failed reason starts with `OutOf` (#4336) 2025-12-16 13:26:44 +01:00
Nikola Jokic 82d5579696
Restart the listener if pod is evicted (#4332) 2025-12-09 17:55:09 +01:00
Nikola Jokic 540269880f
Typo in test name caused test to not execute (#4330) 2025-11-27 15:31:57 +01:00
dependabot[bot] 9ebb97fe2e
Bump the actions group with 3 updates (#4328)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-11-25 12:00:40 +01:00
Nikola Jokic 75c401f6c1
Remove old e2e tests (#4325) 2025-11-25 00:37:32 +01:00
dependabot[bot] a9e371e083
Bump the actions group across 1 directory with 4 updates (#4309)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2025-11-21 19:23:19 +01:00
dependabot[bot] fdf78189ab
Bump golang.org/x/crypto from 0.43.0 to 0.45.0 (#4318)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2025-11-21 17:14:05 +01:00
Marcus Ramberg cac7a40b70
Add support for giving kubernetes mode scaleset service account additional permissions (#4282) 2025-11-21 15:56:08 +01:00
dependabot[bot] 837406ae01
Bump the gomod group across 1 directory with 11 updates (#4317)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nikola Jokic <jokicnikola07@gmail.com>
2025-11-21 14:49:28 +01:00
Nikola Jokic 95d2107a6a
Code style changes on the controller (#4324) 2025-11-21 14:20:44 +01:00
github-actions[bot] 5a6bfc937a
Updates: runner to v2.330.0 (#4319)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-11-21 10:10:16 +01:00
Nikola Jokic 6d07b8d853
Add ephemeral runner finalizer during creation and check finalizer without requeue (#4320) 2025-11-20 23:06:27 +01:00
Nikola Jokic a50d8bfebc
e2e: move from deprecated openebs charts to new registry (#4321) 2025-11-20 22:25:52 +01:00
Nikola Jokic 138b39bfcb
Create e2e test suite (#3136)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2025-11-19 16:25:58 +01:00
Rafik Salama 4615321588
Upgrade Docker and Docker Compose to match GH hosted runner (#4312) 2025-11-13 11:31:17 +01:00
Nikola Jokic 9f9409a4c1
Handle resource quota on status forbidden by retrying (#4305)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-11-10 13:58:25 +01:00
Nikola Jokic 3d73636407
Use combination of namespace, GitHub URL, and runner group when hashing the listener name (#4299) 2025-11-10 13:58:16 +01:00
Nikola Jokic 722c6e9edd
Bump kubebuilder tools in the workflow (#4300) 2025-11-10 12:26:08 +00:00
Nikola Jokic dcb45f0617
Bump timeout for min runners workflow to 30s (#4306) 2025-11-10 12:01:58 +00:00
Jiaren Wu dbac55ca9e
Fix for code scanning alert no. 5: Workflow does not contain permissions (#4292)
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-10-31 10:20:30 +01:00
github-actions[bot] 91d45d870a
Updates: runner to v2.329.0 container-hooks to v0.8.0 (#4279)
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-10-30 10:32:22 +01:00
Nikola Jokic 4d22089978
Delete listener resources without requeueing on each call (#4289) 2025-10-29 13:01:00 +01:00
Nikola Jokic 8007b8af25
Fix first interaction action (#4290) 2025-10-29 12:49:39 +01:00
dependabot[bot] 0baa4f6b09
Bump github/codeql-action from 3 to 4 in the actions group (#4281)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-22 11:26:36 +02:00
Nikola Jokic a0c30df25b
Prepare 0.13.0 release (#4280) 2025-10-16 19:25:56 +02:00
dependabot[bot] 27d03ef2e2
Bump the gomod group across 1 directory with 4 updates (#4277)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-15 01:31:29 +02:00
Nikola Jokic 634e42c916
Bump all dependencies (#4266) 2025-10-14 13:24:25 +02:00
Jiaren Wu 6e46b42bf4
Potential fix for code scanning alert no. 1: Workflow does not contain permissions (#4274)
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jiaren-wu <190862939+jiaren-wu@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-13 11:08:35 -07:00
Jiaren Wu 71ebdd9d3c
Potential fix for code scanning alert no. 3: Workflow does not contain permissions (#4273)
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
2025-10-13 10:38:14 -07:00
Berat Postalcioglu 7604c8361f
docs: fix broken Grafana dashboard JSON path (#4270) 2025-10-09 22:05:43 +02:00
Nikola Jokic 94a6f3cc3a
Ensure ephemeral runner is deleted from the service on exit != 0 (#4260) 2025-10-06 11:38:56 +02:00
Nikola Jokic e3ed1ba226
Introduce new kubernetes-novolume mode (#4250)
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-10-03 12:03:38 +02:00
dependabot[bot] 652bd99439
Bump the actions group across 1 directory with 5 updates (#4262)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-01 17:27:52 +02:00
Yusuke Kuoka f731873df9
Add workflow name and target labels (#4240) 2025-09-30 16:01:51 +02:00
Nikola Jokic 088e2a3a90
Remove ephemeral runner when exit code != 0 and is patched with the job (#4239) 2025-09-17 21:40:37 +02:00
Dennis Stone 2035e13724
Update CODEOWNERS to include new maintainer (#4253) 2025-09-17 21:33:38 +02:00
Nikola Jokic 04b966dfec
Update CODEOWNERS (#4251) 2025-09-17 17:49:12 +02:00